manual/Module (chicken irregex)

~ chicken-core (chicken-5) /manual/Module (chicken irregex)
  1[[tags: manual]]
  2[[toc:]]
  3
  4== Module (chicken irregex)
  5
  6This module provides support for regular expressions, using the
  7powerful ''irregex'' regular expression engine by Alex Shinn.  It
  8supports both POSIX syntax with various (irregular) PCRE extensions,
  9as well as SCSH's SRE syntax, with various aliases for commonly used
 10patterns.  DFA matching is used when possible, otherwise a
 11closure-compiled NFA approach is used.  Matching may be performed over
 12standard Scheme strings, or over arbitrarily chunked streams of
 13strings.
 14
 15On systems that support dynamic loading, the {{irregex}} module can be
 16made available in the CHICKEN interpreter ({{csi}}) by entering
 17
 18<enscript highlight=scheme>
 19(import (chicken irregex))
 20</enscript>
 21
 22=== Procedures
 23
 24==== irregex
 25==== string->irregex
 26==== sre->irregex
 27
 28<procedure>(irregex <posix-string-or-sre> [<options> ...])</procedure><br>
 29<procedure>(string->irregex <posix-string> [<options> ...])</procedure><br>
 30<procedure>(sre->irregex <sre> [<options> ...])</procedure><br>
 31
 32Compiles a regular expression from either a POSIX-style regular
 33expression string (with most PCRE extensions) or an SCSH-style SRE.
 34There is no {{(rx ...)}} syntax - just use normal Scheme lists, with
 35{{quasiquote}} if you like.
 36
 37Technically a string by itself could be considered a valid (though
 38rather silly) SRE, so if you want to just match a literal string you
 39should use something like {{(irregex `(: ,str))}}, or use the explicit
 40{{(sre->irregex str)}}.
 41
 42The options are a list of any of the following symbols:
 43
 44; {{'i}}, {{'case-insensitive}} : match case-insensitively
 45; {{'m}}, {{'multi-line}} : treat string as multiple lines (effects {{^}} and {{$}})
 46; {{'s}}, {{'single-line}} : treat string as a single line ({{.}} can match newline)
 47; {{'utf8}} : utf8-mode (assumes strings are byte-strings)
 48; {{'fast}} : try to optimize the regular expression
 49; {{'small}} : try to compile a smaller regular expression
 50; {{'backtrack}} : enforce a backtracking implementation
 51
 52The {{'fast}} and {{'small}} options are heuristic guidelines and will
 53not necessarily make the compiled expression faster or smaller.
 54
 55==== string->sre
 56==== maybe-string->sre
 57
 58<procedure>(string->sre <str>)</procedure><br>
 59<procedure>(maybe-string->sre <obj>)</procedure><br>
 60
 61For backwards compatibility, procedures to convert a POSIX string into
 62an SRE.
 63
 64{{maybe-string->sre}} does the same thing, but only if the argument is
 65a string, otherwise it assumes {{<obj>}} is an SRE and returns it
 66as-is.  This is useful when you want to provide an API that allows
 67either a POSIX string or SRE (like {{irregex}} or {{irregex-search}}
 68below) - it ensures the result is an SRE.
 69
 70==== glob->sre
 71
 72<procedure>(glob->sre <str>)</procedure>
 73
 74Converts a basic shell-style glob to an SRE which matches only strings
 75which the glob would match.  The glob characters {{[}}, {{]}} {{*}}
 76and {{?}}  are supported.
 77
 78
 79==== irregex?
 80
 81<procedure>(irregex? <obj>)</procedure>
 82
 83Returns {{#t}} iff the object is a regular expression.
 84
 85==== irregex-search
 86
 87<procedure>(irregex-search <irx> <str> [<start> <end>])</procedure>
 88
 89Searches for any instances of the pattern {{<irx>}} (a POSIX string, SRE
 90sexp, or pre-compiled regular expression) in {{<str>}}, optionally between
 91the given range.  If a match is found, returns a match object,
 92otherwise returns {{#f}}.
 93
 94Match objects can be used to query the original range of the string or
 95its submatches using the {{irregex-match-*}} procedures below.
 96
 97Examples:
 98
 99<enscript highlight=scheme>
100(irregex-search "foobar" "abcFOOBARdef") => #f
101
102(irregex-search (irregex "foobar" 'i) "abcFOOBARdef") => #<match>
103
104(irregex-search '(w/nocase "foobar") "abcFOOBARdef") => #<match>
105</enscript>
106
107Note, the actual match result is represented by a vector in the
108default implementation.  Throughout this manual, we'll just write
109{{#<match>}} to show that a successful match was returned when the
110details are not important.
111
112Matching follows the POSIX leftmost, longest semantics, when
113searching.  That is, of all possible matches in the string,
114{{irregex-search}} will return the match at the first position
115(leftmost).  If multiple matches are possible from that same first
116position, the longest match is returned.
117
118==== irregex-match
119==== irregex-match?
120
121<procedure>(irregex-match <irx> <str> [<start> <end>])</procedure>
122<procedure>(irregex-match? <irx> <str> [<start> <end>])</procedure>
123
124Like {{irregex-search}}, but performs an anchored match against the
125beginning and end of the substring specified by {{<start>}} and
126{{<end>}}, without searching.
127
128Where {{irregex-match}} returns a match object, {{irregex-match?}}
129just returns a boolean indicating whether it matched or not.
130
131Examples:
132
133<enscript highlight=scheme>
134(irregex-match '(w/nocase "foobar") "abcFOOBARdef") => #f
135
136(irregex-match '(w/nocase "foobar") "FOOBAR") => #<match>
137</enscript>
138
139==== irregex-match-data?
140
141<procedure>(irregex-match-data? <obj>)</procedure>
142
143Returns {{#t}} iff the object is a successful match result from
144{{irregex-search}} or {{irregex-match}}.
145
146==== irregex-num-submatches
147==== irregex-match-num-submatches
148
149<procedure>(irregex-num-submatches <irx>)</procedure><br>
150<procedure>(irregex-match-num-submatches <match>)</procedure>
151
152Returns the number of numbered submatches that are defined in the
153irregex or match object.
154
155==== irregex-names
156==== irregex-match-names
157
158<procedure>(irregex-names <irx>)</procedure><br>
159<procedure>(irregex-match-names <match>)</procedure>
160
161Returns an association list of named submatches that are defined in
162the irregex or match object.  The {{car}} of each item in this list is
163the name of a submatch, the {{cdr}} of each item is the numerical
164submatch corresponding to this name.  If a named submatch occurs
165multiple times in the irregex, it will also occur multiple times in
166this list.
167
168==== irregex-match-valid-index?
169
170<procedure>(irregex-match-valid-index? <match> <index-or-name>)</procedure><br>
171
172Returns {{#t}} iff the {{index-or-name}} named submatch or index is
173defined in the {{match}} object.
174
175==== irregex-match-substring
176==== irregex-match-start-index
177==== irregex-match-end-index
178
179<procedure>(irregex-match-substring <match> [<index-or-name>])</procedure><br>
180<procedure>(irregex-match-start-index <match> [<index-or-name>])</procedure><br>
181<procedure>(irregex-match-end-index <match> [<index-or-name>])</procedure>
182
183Fetches the matched substring (or its start or end offset) at the
184given submatch index, or named submatch.  The entire match is index 0,
185the first 1, etc.  The default is index 0.
186
187Returns {{#f}} if the given submatch did not match the source string (can happen when you have the submatch inside an {{or}} alternative, for example).
188
189==== irregex-match-subchunk
190==== irregex-match-start-chunk
191==== irregex-match-end-chunk
192
193<procedure>(irregex-match-subchunk <match> [<index-or-name>])</procedure>
194<procedure>(irregex-match-start-chunk <match> [<index-or-name>])</procedure>
195<procedure>(irregex-match-end-chunk <match> [<index-or-name>])</procedure>
196
197Access the chunks delimiting the submatch index, or named submatch.
198
199{{irregex-match-subchunk}} generates a chunked data-type for the given
200match item, of the same type as the underlying chunk type (see Chunked
201String Matching below).  This is only available if the chunk type
202specifies the get-subchunk API, otherwise an error is raised.
203
204Returns {{#f}} if the given submatch did not match the source string (can happen when you have the submatch inside an {{or}} alternative, for example).
205
206==== irregex-replace
207==== irregex-replace/all
208
209<procedure>(irregex-replace <irx> <str> [<replacements> ...])</procedure><br>
210<procedure>(irregex-replace/all <irx> <str> [<replacements> ...])</procedure>
211
212Matches a pattern in a string, and replaces it with a (possibly empty)
213list of substitutions.  Each {{<replacement>}} can be either a string
214literal, a numeric index, a symbol (as a named submatch), or a
215procedure which takes one argument (the match object) and returns a
216string.
217
218Examples:
219
220<enscript highlight=scheme>
221(irregex-replace "[aeiou]" "hello world" "*") => "h*llo world"
222
223(irregex-replace/all "[aeiou]" "hello world" "*") => "h*ll* w*rld"
224
225(irregex-replace/all '(* "foo ") "foo foo platter" "*") => "**p*l*a*t*t*e*r"
226
227(irregex-replace "(.)(.)" "ab" 2 1 "*")  => "ba*"
228
229(irregex-replace "...bar" "xxfoobar" (lambda (m) 
230              (string-reverse (irregex-match-substring m)))) => "xxraboof"
231
232(irregex-replace "(...)(bar)" "xxfoobar"  2 (lambda (m) 
233              (string-reverse (irregex-match-substring m 1)))) => "xxbaroof"
234</enscript>
235
236==== irregex-split
237==== irregex-extract
238
239<procedure>(irregex-split <irx> <str> [<start> <end>])</procedure><br>
240<procedure>(irregex-extract <irx> <str> [<start> <end>])</procedure>
241
242{{irregex-split}} splits the string {{<str>}} into substrings divided
243by the pattern in {{<irx>}}.  {{irregex-extract}} does the opposite,
244returning a list of each instance of the pattern matched disregarding
245the substrings in between.
246
247Empty matches will result in subsequent single character string in
248{{irregex-split}}, or empty strings in {{irregex-extract}}.
249
250<enscript highlight="scheme">
251(irregex-split "[aeiou]*" "foobarbaz") => '("f" "b" "r" "b" "z")
252
253(irregex-extract "[aeiou]*" "foobarbaz") => '("" "oo" "" "a" "" "" "a" "")
254</enscript>
255
256
257==== irregex-fold
258
259<procedure>(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])</procedure>
260
261This performs a fold operation over every non-overlapping place
262{{<irx>}} occurs in the string {{str}}.
263
264The {{<kons>}} procedure takes the following signature:
265
266<enscript highlight=scheme>
267(<kons> <from-index> <match> <seed>)
268</enscript>
269
270where {{<from-index>}} is the index from where we started searching
271(initially {{<start>}} and thereafter the end index of the last
272match), {{<match>}} is the resulting match-data object, and {{<seed>}}
273is the accumulated fold result starting with {{<knil>}}.
274
275The rationale for providing the {{<from-index>}} (which is not
276provided in the SCSH {{regexp-fold}} utility), is because this
277information is useful (e.g. for extracting the unmatched portion of
278the string before the current match, as needed in
279{{irregex-replace/all}}), and not otherwise directly accessible.
280
281Note when the pattern matches an empty string, to avoid an infinite
282loop we continue from one char after the end of the match (as opposed
283to the end in the normal case).  The {{<from-index>}} passed to
284the subsequent \scheme{<kons>} or {{<finish>}} still refers to
285the original previous match end, however, so {{irregex-split}}
286and {{irregex-replace/all}}, etc. do the right thing.
287
288The optional {{<finish>}} takes two arguments:
289
290<enscript highlight=scheme>
291(<finish> <from-index> <seed>)
292</enscript>
293
294which simiarly allows you to pick up the unmatched tail of the string,
295and defaults to just returning the {{<seed>}}.
296
297{{<start>}} and {{<end>}} are numeric indices letting you specify the
298boundaries of the string on which you want to fold.
299
300To extract all instances of a match out of a string, you can use
301
302<enscript highlight=scheme>
303(map irregex-match-substring
304     (irregex-fold <irx>
305                   (lambda (i m s) (cons m s))
306		   '()
307		   <str>
308		   (lambda (i s) (reverse s))))
309</enscript>
310
311Note if an empty match is found {{<kons>}} will be called on that
312empty string, and to avoid an infinite loop matching will resume at
313the next char.  It is up to the programmer to do something sensible
314with the skipped char in this case.
315
316
317=== Extended SRE Syntax
318
319Irregex provides the first native implementation of SREs (Scheme
320Regular Expressions), and includes many extensions necessary both for
321minimal POSIX compatibility, as well as for modern extensions found in
322libraries such as PCRE.
323
324The following table summarizes the SRE syntax, with detailed
325explanations following.
326
327  ;; basic patterns
328  <string>                          ; literal string
329  (seq <sre> ...)                   ; sequence
330  (: <sre> ...)
331  (or <sre> ...)                    ; alternation
332  
333  ;; optional/multiple patterns
334  (? <sre> ...)                     ; 0 or 1 matches
335  (* <sre> ...)                     ; 0 or more matches
336  (+ <sre> ...)                     ; 1 or more matches
337  (= <n> <sre> ...)                 ; exactly <n> matches
338  (>= <n> <sre> ...)                ; <n> or more matches
339  (** <from> <to> <sre> ...)        ; <n> to <m> matches
340  (?? <sre> ...)                    ; non-greedy (non-greedy) pattern: (0 or 1)
341  (*? <sre> ...)                    ; non-greedy kleene star
342  (**? <from> <to> <sre> ...)       ; non-greedy range
343  
344  ;; submatch patterns
345  (submatch <sre> ...)              ; numbered submatch
346  ($ <sre> ...)
347  (submatch-named <name> <sre> ...) ; named submatch
348  (=> <name> <sre> ...)
349  (backref <n-or-name>)             ; match a previous submatch
350  
351  ;; toggling case-sensitivity
352  (w/case <sre> ...)                ; enclosed <sre>s are case-sensitive
353  (w/nocase <sre> ...)              ; enclosed <sre>s are case-insensitive
354  
355  ;; character sets
356  <char>                            ; singleton char set
357  (<string>)                        ; set of chars
358  (or <cset-sre> ...)               ; set union
359  (~ <cset-sre> ...)                ; set complement (i.e. [^...])
360  (- <cset-sre> ...)                ; set difference
361  (& <cset-sre> ...)                ; set intersection
362  (/ <range-spec> ...)              ; pairs of chars as ranges
363  
364  ;; named character sets
365  any
366  nonl
367  ascii
368  lower-case     lower
369  upper-case     upper
370  alphabetic     alpha
371  numeric        num
372  alphanumeric   alphanum  alnum
373  punctuation    punct
374  graphic        graph
375  whitespace     white     space
376  printing       print
377  control        cntrl
378  hex-digit      xdigit
379  
380  ;; assertions and conditionals
381  bos eos                           ; beginning/end of string
382  bol eol                           ; beginning/end of line
383  bow eow                           ; beginning/end of word
384  nwb                               ; non-word-boundary
385  (look-ahead <sre> ...)            ; zero-width look-ahead assertion
386  (look-behind <sre> ...)           ; zero-width look-behind assertion
387  (neg-look-ahead <sre> ...)        ; zero-width negative look-ahead assertion
388  (neg-look-behind <sre> ...)       ; zero-width negative look-behind assertion
389  (atomic <sre> ...)                ; for (?>...) independent patterns
390  (if <test> <pass> [<fail>])       ; conditional patterns
391  commit                            ; don't backtrack beyond this (i.e. cut)
392  
393  ;; backwards compatibility
394  (posix-string <string>)           ; embed a POSIX string literal
395
396==== Basic SRE Patterns
397
398The simplest SRE is a literal string, which matches that string
399exactly.
400
401<enscript highlight=scheme>
402(irregex-search "needle" "hayneedlehay") => #<match>
403</enscript>
404
405By default the match is case-sensitive, though you can control this
406either with the compiler flags or local overrides:
407
408<enscript highlight=scheme>
409(irregex-search "needle" "haynEEdlehay") => #f
410
411(irregex-search (irregex "needle" 'i) "haynEEdlehay") => #<match>
412
413(irregex-search '(w/nocase "needle") "haynEEdlehay") => #<match>
414</enscript>
415
416You can use {{w/case}} to switch back to case-sensitivity inside a
417{{w/nocase}} or when the SRE was compiled with {{'i}}:
418
419<enscript highlight=scheme>
420(irregex-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<match>
421
422(irregex-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f
423</enscript>
424
425''Important:'' characters outside the ASCII range (ie, UTF8 chars) are
426'''not''' matched case insensitively!
427
428Of course, literal strings by themselves aren't very interesting
429regular expressions, so we want to be able to compose them.  The most
430basic way to do this is with the {{seq}} operator (or its abbreviation
431{{:}}), which matches one or more patterns consecutively:
432
433<enscript highlight=scheme>
434(irregex-search '(: "one" space "two" space "three") "one two three") => #<match>
435</enscript>
436
437As you may have noticed above, the {{w/case}} and {{w/nocase}}
438operators allowed multiple SREs in a sequence - other operators that
439take any number of arguments (e.g. the repetition operators below)
440allow such implicit sequences.
441
442To match any one of a set of patterns use the {{or}} alternation
443operator:
444
445<enscript highlight=scheme>
446(irregex-search '(or "eeney" "meeney" "miney") "meeney") => #<match>
447
448(irregex-search '(or "eeney" "meeney" "miney") "moe") => #f
449</enscript>
450
451==== SRE Repetition Patterns
452
453There are also several ways to control the number of times a pattern
454is matched.  The simplest of these is {{?}} which just optionally
455matches the pattern:
456
457<enscript highlight=scheme>
458(irregex-search '(: "match" (? "es") "!") "matches!") => #<match>
459
460(irregex-search '(: "match" (? "es") "!") "match!") => #<match>
461
462(irregex-search '(: "match" (? "es") "!") "matche!") => #f
463</enscript>
464
465To optionally match any number of times, use {{*}}, the Kleene star:
466
467<enscript highlight=scheme>
468(irregex-search '(: "<" (* (~ #\>)) ">") "<html>") => #<match>
469
470(irregex-search '(: "<" (* (~ #\>)) ">") "<>") => #<match>
471
472(irregex-search '(: "<" (* (~ #\>)) ">") "<html") => #f
473</enscript>
474
475Often you want to match any number of times, but at least one time is
476required, and for that you use {{+}}:
477
478<enscript highlight=scheme>
479(irregex-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<match>
480
481(irregex-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<match>
482
483(irregex-search '(: "<" (+ (~ #\>)) ">") "<>") => #f
484</enscript>
485
486More generally, to match at least a given number of times, use {{>=}}:
487
488<enscript highlight=scheme>
489(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<match>
490
491(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<match>
492
493(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f
494</enscript>
495
496To match a specific number of times exactly, use {{=}}:
497
498<enscript highlight=scheme>
499(irregex-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<match>
500
501(irregex-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f
502</enscript>
503
504And finally, the most general form is {{**}} which specifies a range
505of times to match.  All of the earlier forms are special cases of this.
506
507<enscript highlight=scheme>
508(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<match>
509
510(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f
511</enscript>
512
513There are also so-called "non-greedy" variants of these repetition
514operators, by convention suffixed with an additional {{?}}.  Since the
515normal repetition patterns can match any of the allotted repetition
516range, these operators will match a string if and only if the normal
517versions matched.  However, when the endpoints of which submatch
518matched where are taken into account (specifically, all matches when
519using irregex-search since the endpoints of the match itself matter),
520the use of a non-greedy repetition can change the result.
521
522So, whereas {{?}} can be thought to mean "match or don't match,"
523{{??}} means "don't match or match."  {{*}} typically consumes as much
524as possible, but {{*?}} tries first to match zero times, and only
525consumes one at a time if that fails.  If you have a greedy operator
526followed by a non-greedy operator in the same pattern, they can
527produce surprisins results as they compete to make the match longer or
528shorter.  If this seems confusing, that's because it is.  Non-greedy
529repetitions are defined only in terms of the specific backtracking
530algorithm used to implement them, which for compatibility purposes
531always means the Perl algorithm.  Thus, when using these patterns you
532force IrRegex to use a backtracking engine, and can't rely on
533efficient execution.
534
535==== SRE Character Sets
536
537Perhaps more common than matching specific strings is matching any of
538a set of characters.  You can use the {{or}} alternation pattern on a
539list of single-character strings to simulate a character set, but this
540is too clumsy for everyday use so SRE syntax allows a number of
541shortcuts.
542
543A single character matches that character literally, a trivial
544character class.  More conveniently, a list holding a single element
545which is a string refers to the character set composed of every
546character in the string.
547
548<enscript highlight=scheme>
549(irregex-match '(* #\-) "---") => #<match>
550
551(irregex-match '(* #\-) "-_-") => #f
552
553(irregex-match '(* ("aeiou")) "oui") => #<match>
554
555(irregex-match '(* ("aeiou")) "ouais") => #f
556</enscript>
557
558Ranges are introduced with the {{/}} operator.  Any strings or
559characters in the {{/}} are flattened and then taken in pairs to
560represent the start and end points, inclusive, of character ranges.
561
562<enscript highlight=scheme>
563(irregex-match '(* (/ "AZ09")) "R2D2") => #<match>
564
565(irregex-match '(* (/ "AZ09")) "C-3PO") => #f
566</enscript>
567
568In addition, a number of set algebra operations are provided.  {{or}},
569of course, has the same meaning, but when all the options are
570character sets it can be thought of as the set union operator.  This
571is further extended by the {{&}} set intersection, {{-}} set
572difference, and {{~}} set complement operators.
573
574<enscript highlight=scheme>
575(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<match>
576
577(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f
578
579(irregex-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<match>
580
581(irregex-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f
582</enscript>
583
584==== SRE Assertion Patterns
585
586There are a number of times it can be useful to assert something about
587the area around a pattern without explicitly making it part of the
588pattern.  The most common cases are specifically anchoring some
589pattern to the beginning or end of a word or line or even the whole
590string.  For example, to match on the end of a word:
591
592<enscript highlight=scheme>
593(irregex-search '(: "foo" eow) "foo") => #<match>
594
595(irregex-search '(: "foo" eow) "foo!") => #<match>
596
597(irregex-search '(: "foo" eow) "foof") => #f
598</enscript>
599
600The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly.
601{{nwb}} asserts that you are not in a word-boundary - if replaced for
602{{eow}} in the above examples it would reverse all the results.
603
604There is no {{wb}}, since you tend to know from context whether it
605would be the beginning or end of a word, but if you need it you can
606always use {{(or bow eow)}}.
607
608Somewhat more generally, Perl introduced positive and negative
609look-ahead and look-behind patterns.  Perl look-behind patterns are
610limited to a fixed length, however the IrRegex versions have no such
611limit.
612
613<enscript highlight=scheme>
614(irregex-search '(: "regular" (look-ahead " expression"))
615                "regular expression")
616 => #<match>
617</enscript>
618
619The most general case, of course, would be an {{and}} pattern to
620complement the {{or}} pattern - all the patterns must match or the
621whole pattern fails.  This may be provided in a future release,
622although it (and look-ahead and look-behind assertions) are unlikely
623to be compiled efficiently.
624
625==== SRE Utility Patterns
626
627The following utility regular expressions are also provided for common
628patterns that people are eternally reinventing.  They are not
629necessarily the official patterns matching the RFC definitions of the
630given data, because of the way that such patterns tend to be used.
631There are three general usages for regexps:
632
633; searching : search for a pattern matching a desired object in a larger text
634
635; validation : determine whether an entire string matches a pattern
636
637; extraction : given a string already known to be valid, extract certain fields from it as submatches
638
639In some cases, but not always, these will overlap.  When they are
640different, {{irregex-search}} will naturally always want the searching
641version, so IrRegex provides that version.
642
643As an example where these might be different, consider a URL.  If you
644want to match all the URLs in some arbitrary text, you probably want
645to exclude a period or comma at the tail end of a URL, since it's more
646likely being used as punctuation rather than part of the URL, despite
647the fact that it would be valid URL syntax.
648
649Another problem with the RFC definitions is the standard itself may
650have become irrelevant.  For example, the pattern IrRegex provides for
651email addresses doesn't match quoted local parts (e.g.
652{{"first last"@domain.com}}) because these are increasingly rare, and
653unsupported by enough software that it's better to discourage their use.
654Conversely, technically consecutive periods
655(e.g. {{first..last@domain.com}}) are not allowed in email addresses, but
656most email software does allow this, and in fact such addresses are
657quite common in Japan.
658
659The current patterns provided are:
660
661  newline                        ; general newline pattern (crlf, cr, lf)
662  integer                        ; an integer
663  real                           ; a real number (including scientific)
664  string                         ; a "quoted" string
665  symbol                         ; an R5RS Scheme symbol
666  ipv4-address                   ; a numeric decimal ipv4 address
667  ipv6-address                   ; a numeric hexadecimal ipv6 address
668  domain                         ; a domain name
669  email                          ; an email address
670  http-url                       ; a URL beginning with https?://
671
672Because of these issues the exact definitions of these patterns are
673subject to be changed, but will be documented clearly when they are
674finalized.  More common patterns are also planned, but as what you
675want increases in complexity it's probably better to use a real
676parser.
677
678=== Supported PCRE Syntax
679
680Since the PCRE syntax is so overwhelming complex, it's easier to just
681list what we *don't* support for now.  Refer to the
682[[http://pcre.org/pcre.txt|PCRE documentation]] for details.  You
683should be using the SRE syntax anyway!
684
685Unicode character classes ({{\P}}) are not supported, but will be
686in an upcoming release.  {{\C}} named characters are not supported.
687
688Callbacks, subroutine patterns and recursive patterns are not
689supported.  ({{*FOO}}) patterns are not supported and may never be.
690
691{{\G}} and {{\K}} are not supported.
692
693Octal character escapes are not supported because they are ambiguous
694with back-references - just use hex character escapes.
695
696Other than that everything should work, including named submatches,
697zero-width assertions, conditional patterns, etc.
698
699In addition, {{\<}} and {{\>}} act as beginning-of-word and end-of-word
700marks, respectively, as in Emacs regular expressions.
701
702Also, two escapes are provided to embed SRE patterns inside PCRE
703strings, {{"\'<sre>"}} and {{"(*'<sre>)"}}.  For example, to match a
704comma-delimited list of integers you could use
705
706<enscript highlight=scheme>
707"\\'integer(,\\'integer)*"
708</enscript>
709
710and to match a URL in angle brackets you could use
711
712<enscript highlight=scheme>
713"<('*http-url)>"
714</enscript>
715
716Note in the second example the enclosing {{"('*...)"}} syntax is needed
717because the Scheme reader would consider the closing {{">"}} as part of
718the SRE symbol.
719
720The following chart gives a quick reference from PCRE form to the SRE
721equivalent:
722
723  ;; basic syntax
724  "^"                     ;; bos (or eos inside (?m: ...))
725  "$"                     ;; eos (or eos inside (?m: ...))
726  "."                     ;; nonl
727  "a?"                    ;; (? a)
728  "a*"                    ;; (* a)
729  "a+"                    ;; (+ a)
730  "a??"                   ;; (?? a)
731  "a*?"                   ;; (*? a)
732  "a+?"                   ;; (+? a)
733  "a{n,m}"                ;; (** n m a)
734
735  ;; grouping
736  "(...)"                 ;; (submatch ...)
737  "(?:...)"               ;; (: ...)
738  "(?i:...)"              ;; (w/nocase ...)
739  "(?-i:...)"             ;; (w/case ...)
740  "(?<name>...)"          ;; (=> <name>...)
741
742  ;; character classes
743  "[aeiou]"               ;; ("aeiou")
744  "[^aeiou]"              ;; (~ "aeiou")
745  "[a-z]"                 ;; (/ "az") or (/ "a" "z")
746  "[[:alpha:]]"           ;; alpha
747
748  ;; assertions
749  "(?=...)"               ;; (look-ahead ...)
750  "(?!...)"               ;; (neg-look-ahead ...)
751  "(?<=...)"              ;; (look-behind ...)
752  "(?<!...)"              ;; (neg-look-behind ...)
753  "(?(test)pass|fail)"    ;; (if test pass fail)
754  "(*COMMIT)"             ;; commit
755
756=== Chunked String Matching
757
758It's often desirable to perform regular expression matching over
759sequences of characters not represented as a single string.  The most
760obvious example is a text-buffer data structure, but you may also want
761to match over lists or trees of strings (i.e. ropes), over only
762certain ranges within a string, over an input port, etc.  With
763existing regular expression libraries, the only way to accomplish this
764is by converting the abstract sequence into a freshly allocated
765string.  This can be expensive, or even impossible if the object is a
766text-buffer opened onto a 500MB file.
767
768IrRegex provides a chunked string API specifically for this purpose.
769You define a chunking API with {{make-irregex-chunker}}:
770
771==== make-irregex-chunker
772
773<procedure>(make-irregex-chunker <get-next> <get-string> [<get-start> <get-end> <get-substring> <get-subchunk>])</procedure>
774
775where 
776
777{{(<get-next> chunk) => }} returns the next chunk, or {{#f}} if there are no more chunks
778
779{{(<get-string> chunk) => }} a string source for the chunk
780
781{{(<get-start> chunk) => }} the start index of the result of {{<get-string>}} (defaults to always 0)
782
783{{(<get-end> chunk) => }} the end (exclusive) of the string (defaults to {{string-length}} of the source string)
784
785{{(<get-substring> cnk1 i cnk2 j) => }} a substring for the range between the chunk {{cnk1}} starting at index {{i}} and ending at {{cnk2}} at index {{j}}
786
787{{(<get-subchunk> cnk1 i cnk2 j) => }} as above but returns a new chunked data type instead of a string (optional)
788
789There are two important constraints on the {{<get-next>}} procedure.
790It must return an {{eq?}} identical object when called multiple times
791on the same chunk, and it must not return a chunk with an empty string
792(start == end).  This second constraint is for performance reasons -
793we push the work of possibly filtering empty chunks to the chunker
794since there are many chunk types for which empty strings aren't
795possible, and this work is thus not needed.  Note that the initial
796chunk passed to match on is allowed to be empty.
797
798{{<get-substring>}} is provided for possible performance improvements
799- without it a default is used.  {{<get-subchunk>}} is optional -
800without it you may not use {{irregex-match-subchunk}} described above.
801
802You can then match chunks of these types with the following
803procedures:
804
805==== irregex-search/chunked
806==== irregex-match/chunked
807
808<procedure>(irregex-search/chunked <irx> <chunker> <chunk> [<start>])</procedure><br>
809<procedure>(irregex-match/chunked <irx> <chunker> <chunk> [<start>])</procedure>
810
811These return normal match-data objects.
812
813Example:
814
815To match against a simple, flat list of strings use:
816
817<enscript highlight=scheme>
818  (define (rope->string rope1 start rope2 end)
819    (if (eq? rope1 rope2)
820        (substring (car rope1) start end)
821        (let loop ((rope (cdr rope1))
822                   (res (list (substring (car rope1) start))))
823           (if (eq? rope rope2)
824               (string-concatenate-reverse      ; from SRFI-13
825                (cons (substring (car rope) 0 end) res))
826               (loop (cdr rope) (cons (car rope) res))))))
827
828  (define rope-chunker
829    (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x)))
830                          car
831                          (lambda (x) 0)
832                          (lambda (x) (string-length (car x)))
833                          rope->string))
834
835  (irregex-search/chunked <pat> rope-chunker <list-of-strings>)
836</enscript>
837
838Here we are just using the default start, end and substring behaviors,
839so the above chunker could simply be defined as:
840
841<enscript highlight=scheme>
842  (define rope-chunker
843    (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car))
844</enscript>
845
846==== irregex-fold/chunked
847
848<procedure>(irregex-fold/chunked <irx> <kons> <knil> <chunker> <chunk> [<finish> [<start-index>]])</procedure>
849
850Chunked version of {{irregex-fold}}.
851
852=== Utilities
853
854The following procedures are also available.
855
856==== irregex-quote
857
858<procedure>(irregex-quote <str>)</procedure>
859
860Returns a new string with any special regular expression characters
861escaped, to match the original string literally in POSIX regular
862expressions.
863
864==== irregex-opt
865
866<procedure>(irregex-opt <list-of-strings>)</procedure>
867
868Returns an optimized SRE matching any of the literal strings
869in the list, like Emacs' {{regexp-opt}}.  Note this optimization
870doesn't help when irregex is able to build a DFA.
871
872==== sre->string
873
874<procedure>(sre->string <sre>)</procedure>
875
876Convert an SRE to a PCRE-style regular expression string, if
877possible.
878
879
880---
881Previous: [[Module (chicken io)]]
882
883Next: [[Module (chicken keyword)]]
Trap